-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): add benchmark for jest runner #2618
Conversation
@Lakitna Still keeping high a.t.m. 😅 This adds a performance test for a jest project. A big one named lighthouse. I want to focus on the jest runner in the near future, so thought it made sense to add a benchmark first. I still agree with the benefits of "dropping down" as specified in #2434. I might add some lower-level benchmark tests when I implement the improvements. |
Sounds good :) Let me know when you're ready for a comprehensive baseline on multiple concurrencies. |
It runs on GH actions now (took some time, because I had to use It only mutates "lighthouse-core/audits/**/*.js", because it already takes 2h 18min on GH actions (thats a
This is what the report looks like: |
Yes, will do. |
@Lakitna it took me some work, but I think it's ready for benchmarking. You can run locally with
Here is a job that runs all performance tests: https://github.com/stryker-mutator/stryker/runs/1420244888?check_suite_focus=true |
Results with gh workflow:
|
"noImplicitAny": true, | ||
"noImplicitReturns": true, | ||
"noImplicitThis": true, | ||
"noImplicitAny": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to switch off these options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise I think it is good :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, not really important since Stryker disables type checking. I tried to run the tests locally en then the compilation failed. The reason for that is probably because it is really old typescript code (stems from angular 4 days).
I tried to run, but I got a short runtime and a mutation score of 0%... Something is going wrong here. Any ideas what's happening? @ commit: c7d1ea2 Edit: Could it maybe be the plugin stuff? I'm on Windows after all.
Update: Update: I keep updating it seems. Turns out I introduced an error in the wrong test o.0 Initial test run does fail when it should. |
Might be an issue. I can try on windows as well in a few hours. |
Hmm I've got the same result as you did @Lakitna. I'm pretty sure this related to the way jest works on windows in combination with running the jest-runner from a different directory. Will look into it more, would be great if we can run the perf tests on windows as well. |
Works now since #2623 Hope that didn't break anything for others 🤷♂️ |
What could possibly go wrong 🤷♂️ |
I'm running concurrency 15, 12, 8, and 4 right now. I think that should do it. I am doing it on my desktop this time though. With the long runtimes, it's easier. It's still an 8 core 16 thread CPU, this time a Ryzen 3700X. Just note that it might cause slight differences with the previous Express bench results. Update: Only slightly related result. The memory thing I mentioned before is very apparent with this test suite. I still think it's not a performance issue, but it is notable. Whelp, I spoke too soon. This most definitely is a performance issue. You're looking at CPU slowdowns because of a lack of memory. Probably an issue with the test suite, not Stryker. |
Maybe add |
Here are the results of this nights run. Ran at 1416611. I made a mistake causing all runs to be at concurrency 15... Almost as if people are not very perceptive at night 🤭 On the bright side, we can see how stable default concurrency is.
All in all, it looks to be pretty stable. There are some differences, but they are within 0.1 percentage point. I'm running different concurrencies now. This time for real, I double-checked. |
Here are the results we actually need:
Durations and test/mutants are interesting here. I'll run some missing concurrencies so I can make a duration graph as I did for Express. First impressions suggest a tipping point somewhere between concurrency 8 and 4. To that end, I'll run 7, 6, and 5 to fill in the gaps. Scores seem very stable. Timeouts are manageable. All in all, it seems to be a lot more stable compared to the Express bench. |
The results are in (previous comment), and stability is great! :) On all metrics but runtime... The graph is the duration per concurrency. The concurrency 4 run is just such a weird outlier. It makes me think like there was an issue during 4 run or something. I'll run 4 and 3 to get some more data points. |
Wow! This is amazing. Thanks so much for taking the time to run this. Do you want me to put this graph in the readme? Then we should update it once I've implemented some improvements 😅 |
We can definitely use the data here to find out how much of a performance delta there is between changes :) It would also be neat to show the basic relationship between concurrency and runtime. If you're interested, I have a similar graph for the Mocha runner in the Express bench. It has more data points but shows the same relation between the two metrics. Edit: It would also be neat to make a similar graph for the relation between mutant count and duration. However, that would require a pretty specialized test setup. Currently, I do not have the time to create such a setup. |
Do you mean that it would require a lot of scripting to automate it? That's true, and it would take a dedicated server to run. Right now we're using the free GH actions hardware, that won't do. |
Hmm, not necessarily specialized hardware. We can run it on my machine as far as I'm concerned. Unless you want to automatically test for performance regression. What I would like for this is a setup where we can test the same code with a variable number of mutations. To do that, however, we need a variable-sized codebase with a variable amount of tests tied to it. All so we can simulate a project growing. The metric for growth would be mutation count. It's basically to find out how Stryker scales with the codebase it tests. If we want to find out how Stryker handles, all other changes must scale linearly. Imagine results like:
It's not a simple implementation o.0 And we might even want to make it more complex by adding the mutation score variable:
|
I started "funny, small experiment" about these tables. With simple code: const vals = []; for(let i = 0; i < 10000; i++) {
vals.push(`export const test${i} = (a: number, b: number) => {
return a + b;
};
`)
} I am generating 20000 mutants, so i can make 100/1000/10000/100000 mutants and check the speed :) EDIT: actually with 20000 mutants in 1 file (10000 test functions and tests), I managed to crash VSC 🗡️ |
That's a great start, but it also shows how many variables there are :) In an ideal world, you would isolate a single variable for these kinds of tests. That will take quite a lot of effort I'm afraid. That being said. I'm interested in running this to see what kind of results we get. Can you share it in such a way that I can set the number of mutations on the command line? (e.g.
Awesome, I'd also be interested to see what makes it crash! Lack of RAM I assume. It'd be interesting to find out what that means for large codebases. And (Typescript) codebases that bundle during transpilation? |
Yea, I am going to try single source/test file 100-500-1000-2500-5000-10000-20000, multifile - 100 mutants/tests per file, and random size - basically Math.random() with some tweaks :P
Sure, actually im thinking of making a repository for it so I could run everything at once HEHEHEHEHEHE
Yea, it seems that if you run it from VSC, VSC process gets more and more RAM usage (and actually leak I think XD) [I got over 8GB at the crash point], but from GIT I managed to get normal run without any significant RAM overflow - 11 runs up to 400mb per each |
It'd be great if we can make a scatterplot with trendline for those results. Like the one above. |
This adds a performance test for a jest project. A big one named lighthouse. I want to focus on the jest runner in the near future, so thought it made sense to add a benchmark first.